Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support json index #36750

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

feat: support json index #36750

wants to merge 10 commits into from

Conversation

sunby
Copy link
Contributor

@sunby sunby commented Oct 10, 2024

This PR adds json index support for json and dynamic fields. Now you can only do unary query like 'a["b"] > 1' using this index. We will support more filter type later.

basic usage:

collection.create_index("json_field", {"index_type": "INVERTED",
    "params": {"json_cast_type": DataType.STRING, "json_path":
'json_field["a"]["b"]'}})

There are some limits to use this index:

  1. If a record does not have the json path you specify, it will be ignored and there will not be an error.
  2. If a value of the json path fails to be cast to the type you specify, it will be ignored and there will not be an error.
  3. A specific json path can have only one json index.
  4. If you try to create more than one json indexes for one json field, sdk(pymilvus<=2.4.7) may return immediately because of internal implementation. This will be fixed in a later version.

@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sunby
To complete the pull request process, please assign xiaofan-luan after the PR has been reviewed.
You can assign the PR to them by writing /assign @xiaofan-luan in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sre-ci-robot sre-ci-robot added the size/XL Denotes a PR that changes 500-999 lines. label Oct 10, 2024
@mergify mergify bot added dco-passed DCO check passed. kind/feature Issues related to feature request from users labels Oct 10, 2024
Copy link
Contributor

mergify bot commented Oct 10, 2024

@sunby Please associate the related issue to the body of your Pull Request. (eg. “issue: #”)

Copy link
Contributor

mergify bot commented Oct 10, 2024

@sunby go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Oct 10, 2024

@sunby E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Oct 12, 2024

@sunby go-sdk check failed, comment rerun go-sdk can trigger the job again.

@sre-ci-robot sre-ci-robot added size/XXL Denotes a PR that changes 1000+ lines. and removed size/XL Denotes a PR that changes 500-999 lines. labels Oct 12, 2024
@@ -201,6 +202,16 @@ class SegmentInternalInterface : public SegmentInterface {
return *ptr;
}

template <typename T>
const index::ScalarIndex<T>&
chunk_scalar_index(std::string path, int64_t chunk_id) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path argument passed in as nested_path may be duplicate between 2 different fields, we should introduce an additional fieldId argument here.

}

// check duplicate json path
exists := s.meta.indexMeta.GetFieldIndexes(req.GetCollectionID(), req.GetFieldID(), req.GetIndexName())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it allowed to create duplicated indexes with the same json path and different index names?

return nil
}

func (s *Server) parseNestedPath(identifier string, schema *schemapb.CollectionSchema) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document the format of identifier?

PS, the code would be much easier to read if the parse is replaced with regex match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document the format of identifier?

PS, the code would be much easier to read if the parse is replaced with regex match.

it's copied from parser implentation

if err != nil {
log.Warn("failed to load index for segment", zap.Error(err))
return err
if info.IsJson {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to change the getResourceUsageEstimateOfSegment function per quantity and size of json indexes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to change the getResourceUsageEstimateOfSegment function per quantity and size of json indexes?

we use inverted index to implement json indexes so we can reuse the memroy estimation of inverted index.

Copy link
Contributor

mergify bot commented Oct 31, 2024

@sunby go-sdk check failed, comment rerun go-sdk can trigger the job again.

1 similar comment
Copy link
Contributor

mergify bot commented Oct 31, 2024

@sunby go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link

codecov bot commented Oct 31, 2024

Codecov Report

Attention: Patch coverage is 53.30396% with 106 lines in your changes missing coverage. Please review.

Project coverage is 67.19%. Comparing base (7cfd609) to head (43e288f).
Report is 12 commits behind head on master.

Current head 43e288f differs from pull request most recent head b6add46

Please upload reports for the commit b6add46 to get more accurate results.

Files with missing lines Patch % Lines
internal/core/src/index/IndexFactory.cpp 0.00% 34 Missing ⚠️
...core/src/indexbuilder/JsonInvertedIndexCreator.cpp 0.00% 21 Missing ⚠️
internal/core/src/exec/expression/UnaryExpr.cpp 59.52% 17 Missing ⚠️
...rnal/core/src/segcore/ChunkedSegmentSealedImpl.cpp 0.00% 11 Missing ⚠️
internal/core/src/index/InvertedIndexTantivy.h 76.47% 4 Missing ⚠️
internal/core/src/segcore/SegmentInterface.h 55.55% 4 Missing ⚠️
internal/core/src/segcore/load_index_c.cpp 0.00% 4 Missing ⚠️
internal/core/src/index/InvertedIndexTantivy.cpp 0.00% 3 Missing ⚠️
...l/core/src/indexbuilder/JsonInvertedIndexCreator.h 84.21% 3 Missing ⚠️
...ernal/core/src/indexbuilder/ScalarIndexCreator.cpp 25.00% 3 Missing ⚠️
... and 1 more
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #36750       +/-   ##
===========================================
- Coverage   83.23%   67.19%   -16.05%     
===========================================
  Files        1015      292      -723     
  Lines      157697    25578   -132119     
===========================================
- Hits       131260    17186   -114074     
+ Misses      21228     8392    -12836     
+ Partials     5209        0     -5209     
Components Coverage Δ
Client ∅ <ø> (∅)
Core 67.19% <52.04%> (∅)
Go ∅ <ø> (∅)
Files with missing lines Coverage Δ
internal/core/src/common/FieldDataInterface.h 56.76% <100.00%> (ø)
internal/core/src/common/FieldMeta.h 94.64% <ø> (ø)
...e/src/exec/expression/BinaryArithOpEvalRangeExpr.h 100.00% <100.00%> (ø)
...nternal/core/src/exec/expression/BinaryRangeExpr.h 92.68% <100.00%> (ø)
internal/core/src/exec/expression/ExistsExpr.h 100.00% <100.00%> (ø)
internal/core/src/exec/expression/Expr.h 69.07% <100.00%> (ø)
...ternal/core/src/exec/expression/JsonContainsExpr.h 100.00% <100.00%> (ø)
internal/core/src/exec/expression/TermExpr.h 85.71% <100.00%> (ø)
internal/core/src/exec/expression/UnaryExpr.h 77.77% <100.00%> (ø)
internal/core/src/index/IndexFactory.h 100.00% <ø> (ø)
... and 14 more

... and 1283 files with indirect coverage changes

Copy link
Contributor

mergify bot commented Oct 31, 2024

@sunby E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@xiaofan-luan
Copy link
Collaborator

@sunby

Are you done with this feature so I can start to review it?

Copy link
Contributor

mergify bot commented Nov 4, 2024

@sunby E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Nov 4, 2024

@sunby go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Nov 4, 2024

@sunby cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

Copy link
Contributor

mergify bot commented Nov 5, 2024

@sunby E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Nov 5, 2024

@sunby go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Nov 5, 2024

@sunby cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

1 similar comment
Copy link
Contributor

mergify bot commented Nov 6, 2024

@sunby cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

Copy link
Contributor

mergify bot commented Nov 6, 2024

@sunby go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Nov 6, 2024

@sunby E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

1 similar comment
Copy link
Contributor

mergify bot commented Nov 6, 2024

@sunby E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Nov 6, 2024

@sunby go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Nov 6, 2024

@sunby cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

Copy link
Contributor

mergify bot commented Nov 6, 2024

@sunby go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Nov 6, 2024

@sunby E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Nov 6, 2024

@sunby go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Nov 6, 2024

@sunby E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Nov 7, 2024

@sunby E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

This PR adds json index support for json and dynamic fields. Now you can only do unary query like 'a["b"] > 1' using this index. We will support more filter type later.

basic usage:
```
collection.create_index("json_field", {"index_type": "INVERTED",
    "params": {"json_cast_type": DataType.STRING, "json_path":
'json_field["a"]["b"]'}})
```

There are some limits to use this index:
1. If a record does not have the json path you specify, it will be ignored and there will not be an error.
2. If a value of the json path fails to be cast to the type you specify,  it will be ignored and there will not be an error.
3. A specific json path can have only one json index.
4. If you try to create more than one json indexes for one json field, sdk(pymilvus<=2.4.7) may return immediately because of internal implementation. This will be fixed in a later version.

Signed-off-by: sunby <[email protected]>
Signed-off-by: sunby <[email protected]>
Signed-off-by: sunby <[email protected]>
Signed-off-by: sunby <[email protected]>
Signed-off-by: sunby <[email protected]>
Signed-off-by: sunby <[email protected]>
Signed-off-by: sunby <[email protected]>
Signed-off-by: sunby <[email protected]>
Signed-off-by: sunby <[email protected]>
Copy link
Contributor

mergify bot commented Nov 7, 2024

@sunby cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

Copy link
Contributor

mergify bot commented Nov 7, 2024

@sunby E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/compilation area/internal-api area/test dco-passed DCO check passed. do-not-merge/missing-related-issue kind/feature Issues related to feature request from users sig/testing size/XXL Denotes a PR that changes 1000+ lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants